We will look at some of the ways R can display information graphically, after dealing with our dataset on R session. With the several innated functions and packages on R, you could draw several different and elaborated plots as you want.
Here, I will introduce some basic plotting commands with built-in dataset, called airquality. In other words, we do not need any data import in this time.
A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box. Since we have imported data, we will make a strip chart for ozone reading.
stripchart(airquality$Ozone)
We can see that the data is mostly cluttered below 50 with one falling outside 150. However, we could pass some additional parameters to control the way how plot looks like. You can read about them in the help section by using question mark.
?stripchart
Some of the frequently used ones are, main to give the title, xlab and ylab to provide labels for the axes, method to specify the way coincident points are plotted like stacked or jitter, col to define color etc. Additionally, with the argument vertical=TRUE we can plot it vertically and with pch we can specify the plotting character (square by default). Some values of pch are 0 for square, 1 for circle, 2 for triangle etc. You can see the full list in the help section ?points.
What if we try to use several parameters? Let’s try to setting the jitter for the method and giving details.
stripchart(airquality$Ozone,
main="Mean ozone in parts per billion at Roosevelt Island",
xlab="Parts Per Billion",
ylab="Ozone",
method="jitter",
col="orange",
pch=1
)
We can draw multiple strip charts in a single plot, by passing in a list of numeric vectors.
Let us consider the Temp field of airquality dataset. Let us also generate normal distribution with the same mean and standard deviation and plot them side by side for comparison.
# prepare the data
temp <- airquality$Temp
# gererate normal distribution with same mean and sd
tempNorm <- rnorm(200,mean=mean(temp, na.rm=TRUE), sd = sd(temp, na.rm=TRUE))
# make a list
x <- list("temp"=temp, "norm"=tempNorm)
Now we us make two stripcharts with this list in one chart.
stripchart(x,
main="Multiple stripchart for comparision",
xlab="Degree Fahrenheit",
ylab="Temperature",
method="jitter",
col=c("orange","red"),
pch=16
)
Stripchart function also takes in formulas of the form y ~ x where, y is a numeric vector which is grouped according to the value of x. For example, in our dataset airquality, the Temp can be our numeric vector. Month can be our grouping variable, so that we get the strip chart for each month separately. In airquality dataset, month is in the form of number (1=January, 2-Febuary and so on).
stripchart(Temp~Month,
data=airquality,
main="Different strip chart for each month",
xlab="Months",
ylab="Temperature",
col="brown3",
group.names=c("May","June","July","August","September"),
vertical=TRUE,
pch=16
)
A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set, by taking in any number of numeric vectors.
Let’s start with the basic boxplot with Ozone readings from airquality data.
boxplot(airquality$Ozone)
We can see that data above the median is more dispersed. We can also notice two outliers at the higher extreme. Like strip chart, we could adjust parameters with same parameters we used.
boxplot(airquality$Ozone,
main = "Mean ozone in parts per billion at Roosevelt Island",
xlab = "Parts Per Billion",
ylab = "Ozone",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
What if we would like to return the values plotted on boxplot? Here, we could get some information regarding the values on the boxplot by giving the variable to the plot.
b
$stats
[,1]
[1,] 1.0
[2,] 18.0
[3,] 31.5
[4,] 63.5
[5,] 122.0
$n
[1] 116
$conf
[,1]
[1,] 24.82518
[2,] 38.17482
$out
[1] 135 168
$group
[1] 1 1
$names
[1] "1"
As we could see above, a list is returned. - stats: position of the upper/lower extremes of the whiskers and box along the median - n: the number of observations the boxplot is drawn with (NA value is not counted) - conf: upper/lower extremes of the notch - out: value of the outliers - group: a vector of the same length as out whose elements indicate to which group the outlier belongs - names: a vector of names for the groups
We can draw multiple boxplots in a single plot, by passing in a list, data frame or multiple vectors, as we did on strip chart. Let us consider the Ozone and Temp field of airquality dataset. Let us also generate normal distribution with the same mean and standard deviation and plot them side by side for comparison.
# prepare the data
ozone <- airquality$Ozone
temp <- airquality$Temp
# gererate normal distribution with same mean and sd
ozone_norm <- rnorm(200,mean=mean(ozone, na.rm=TRUE), sd=sd(ozone, na.rm=TRUE))
temp_norm <- rnorm(200,mean=mean(temp, na.rm=TRUE), sd=sd(temp, na.rm=TRUE))
Now we us make 4 boxplots with this data. We use the arguments at and names to denote the place and label.
boxplot(ozone, ozone_norm, temp, temp_norm,
main = "Multiple boxplots for comparision",
at = c(1,2,4,5),
names = c("ozone", "normal", "temp", "normal"),
las = 2,
col = c("orange","red"),
border = "brown",
horizontal = TRUE,
notch = TRUE
)
boxplot(Temp~Month,
data=airquality,
main="Different boxplots for each month",
xlab="Month Number",
ylab="Degree Fahrenheit",
col="orange",
border="brown"
)
A histogram is very common plot. It plots the frequencies that data appears within certain ranges. On R, histogram can be created using the hist() function. This function takes in a vector of values for which the histogram is plotted.
Temperature <- airquality$Temp
hist(Temperature)
We can see above that there are 9 cells with equally spaced breaks. In this case, the height of a cell is equal to the number of observation falling in that cell.
Still the parameters are same as the previous two plots, there is several additional parameters to give. For instance, col to define color and the argument freq=F to get the probability distribution rather than frequency.
# histogram with added parameters
hist(Temperature,
main="Maximum daily temperature at La Guardia Airport",
xlab="Temperature in degrees Fahrenheit",
xlim=c(50,100),
col="darkmagenta",
freq=FALSE
)
Since we set freq=F, y-axis label is density, not frequency. In this case, the total area of the histogram is 1.
h
$breaks
[1] 55 60 65 70 75 80 85 90 95 100
$counts
[1] 8 10 15 19 33 34 20 12 2
$density
[1] 0.010457516 0.013071895 0.019607843 0.024836601 0.043137255 0.044444444 0.026143791 0.015686275 0.002614379
$mids
[1] 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5
$xname
[1] "Temperature"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Although the strategy is same, but the result is a little different from previous. The meaning of each class is - breaks: places where the breaks occur - counts: the number of observations falling in the cell - density: the density of cells - mids: the midpoints of cells - xname: the x argument name - equidist: a logical value indicating if the breaks are equally spaced
h <- hist(Temperature,ylim=c(0,40))
text(h$mids,h$counts,labels=h$counts, adj=c(0.5, -0.5))
With the break argument, we could specify the number of cells we want in the histogram. If we do not give value for break, R calculates the best number of cells, keeping this suggestion in mind. Following are two histograms on the same data with different number of cells.
par(mfrow=c(1,2))
hist(Temperature, breaks=4, main="With breaks=4")
hist(Temperature, breaks=20, main="With breaks=20")
On the above figure we see that the actual number of cells plotted is greater than we had specified.
We can also define breakpoints between the cells as a vector. This makes it possible to plot a histogram with unequal intervals. In such case, the area of the cell is proportional to the number of observations falling inside that cell.
If we would like to give different cell width, what we could set is break but ot uniform interval.
hist(Temperature,
main="Maximum daily temperature at La Guardia Airport",
xlab="Temperature in degrees Fahrenheit",
xlim=c(50,100),
col="chocolate",
border="brown",
breaks=c(55,60,70,75,80,100)
)
A scatter plot provides a graphical view of the relationship between two sets of numbers.
It is always common that the parameters could be adjusted. The parameters are same as strip chart.
plot(airquality$Temp, airquality$Ozone, col="red", pch =19, main="Relationship between Ozone reading and temperature")
Here, pch option is corresponding to the shape of point.
?points
plot(airquality$Temp, airquality$Ozone, col="red", pch =4, main="Relationship between Ozone reading and temperature")
Since scatter plot is used to visualize the relationship between two variables, it is not proper to plot several variables in one plot. However, by creating a matrix of scatterplots, we could see the relationship between several variables.
The last plot we would try is pie chart. It is drawn using the pie function in R programming . This function takes in a vector of non-negative numbers. Here, since the airquality dataset is not proper for plotting pie chart, we will generate some dataset by ourselves.
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")
It is kind of basic form, however, it is a little hard to get the ratio on the plot. What if we could add the percentage value with the annotation?
To add the percentage as annotation, what we should do is calculate the percentage and creat the labels for the value. The total percentage should be 100%
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries")
What if we could manage 3D pie chart which we could create on Excel? The pie3D( ) function in the plotrix package provides 3D exploded pie charts.
library(plotrix)
pie3D(slices,labels=lbls,explode=0.1, main="Pie Chart of Countries ")
Since we are already getting familiar with the basic plotting functions via several practices from previous examples. A variety of different subjects ranging from plotting options to the formatting of plot is given.
In this section, we will try to generate plots according to the various distributions. In this time, we will generate several distributions with random numbers.
x <- data$X
y <- data$Y
Sometimes, if we are working with replicates, it would generate a set of data points. In this case, we could add error bars Tby using the arrows command.
plot(x,y,xlab="drug X",ylab="drug Y",main="Random Stuff")
xHigh <- x
yHigh <- y + std.error(y)
xLow <- x
yLow <- y - std.error(y)
arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)
However, it seems that the under error bar of the smallest X value and upper error bar of the lagest X value is not showed properly. Therefore, we should adjust the range of Y axis. In this case, ylim option is the key!
plot(x,y,xlab="drug X",ylab="drug Y",main="Random Stuff", ylim=c(15, 50))
xHigh <- x
yHigh <- y + std.error(y)
xLow <- x
yLow <- y - std.error(y)
arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)
Note that the option code is used to specify where the bars are drawn. Its value can be 1, 2, or 3. If code is 1 the bars are drawn at pairs given in the first argument. If code is 2 the bars are drawn at the pairs given in the second argument. If code is 3 the bars are drawn at both.
Here, to put several plots on one session, we need to use par function. Tghis command can be used to set different parameters. In the example above the mfrow was also set. The mfrow is composed of a vector with two entries, the first is the number of rows and second is number of columns. By setting mfrow, the plots would be arranged as we want.
drugx <- data$X
drugy <- data$Y
par(mfrow=c(2,3))
boxplot(drugx,main="first plot")
boxplot(drugy,main="second plot")
plot(jitter(drugx),jitter(drugy),xlab="Number White Marbles Drawn",
ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")
hist(drugx,main="fourth plot")
hist(drugy,main="fifth plot")
mosaicplot(table(drugx,drugy),main="sixth plot")
The density plot is not commonly used, but could be a nice plot if we would like to show the density of signals or frequencies. This can be done using the smoothScatter command.
smoothScatter(drugx,drugy,
xlab="Drug X",ylab="Drug Y",main="Drawing Marbles")
Note that the density plot may benefit by superimposing a grid to help delimit the points of interest. This can be done using the grid command.
smoothScatter(drugx,drugy,
xlab="Drug X",ylab="Drug Y",main="Drawing Marbles")
grid(4,3)
A large number of relationships can be plotted at one time using the pairs command. This is helpful when you would like to give the relations as a matrix or a data frame.
In this case, we will use random numbers created by R.
uData <- rnorm(20)
vData <- rnorm(20,mean=5)
wData <- uData + 2*vData + rnorm(20,sd=0.5)
xData <- -2*uData+rnorm(20,sd=0.1)
yData <- 3*vData+rnorm(20,sd=2.5)
d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
pairs(d)
Since we already practiced how to use the session and package, and basic plotting with innate functions of R. However, we usually work for Next-Generation Sequencing (NGS) data and it requires different type of plotting. The major plots we use for RNA-seq quantification is heatmap, clustering and PCA plot. Here, we will try two different type of plotting with example data.
In example data, there are five different samples from all different origin of cells. There is no duplicate in this case, neither the same cell type. Therefore, what we expect is different pattern of clustering and heatmap pattern.
The data is already normalized into count per 20 million, so we do not have to do any normalization or transformation. I highly recommend to use the normalized data since it is easier to deal with, rather than scaling on R session if you are not familiar with R.
Let’s load the data on the session.
count <- read.delim("gene_exp_cp20m.csv", h=T, sep=",")
head(count)
If we would like to check the number of genes in the dataset, the function dim will show the information.
dim(count)
[1] 49671 7
There are seven different columns and the total number of genes is 49671.
Since the matrix should have the gene expression values only with the rownames, what we will do is creating proper format of gene expression matrix. Sometimes the same genes with same geneID could be included, we should check it by unique function. It is not mandatory, but recommended.
count <- unique(count)
dsample <- count[,3:ncol(count)]
rownames(dsample) <- count[,1]
head(dsample)
When we create the proper format of matrix, the rowname is geneID, not gene symbol. Because reference genome includes the isoforms, their gene symbol is same but the gene ID is totally different. R takes the rowname with unique values only, therefore we could not use gene symbol when indicating the gene information as row name.
Before we start plotting, what we should do is filtering out the genes which is not proper to be considered as expressed. Here, we will filter out if the sum of gene expression value is less than 40.
keep <- rowSums(dsample) > 40
dsample <- dsample[keep,]
dim(dsample)
[1] 20960 5
After filtration, there is 20960 genes left. We will start to check the clustering and heatmap with this filtered one.
The first one we would like to try is clustering. Here, we will try two different type of clustering, unsupervised clustering and correlation-based. The result of clustering plot is dendrogram, composed of several trees between samples.
Unsupervised clustering is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. Unsupervised clustering is one of the hierarchical clustering methods, building a hierarchy of clusters. In this example, the distance is euclidean distance, the straight distance between two points.
The 2D format of unsupervised clustering only has branches between the samples. This is the simplest one which we could use for RNA-seq data. Also, we could draw the unsupervised clustering results with all the genes in the data or selection some most variable genes.
hclust function is used for hierarchical cluster analysis on a set of dissimilarities and a representative methods for analyzing it. With this function, we could generate the unsupervised clustering results to plot on the session.
hs <- hclust(dist(t(dsample)))
plot(hs)
The height is the euclidean distance between samples. The higher height is, the more different between each other.
Since we would like to see how the samples are variable, what we could try to do is extracting a certain number of most variable genes and then draw dendrogram. The strategy is totally same as the one with whole dataset, but the step to calculate and extract variable genes are added.
varM <- apply(dsample, 1, var) #Calculating variances between samples
varM1 <- varM[order(varM, decreasing=TRUE)] # Order the calculated variances
Here, we will use only the 500 most variable genes, but you could adjust number of genes by changing the range of variance data.
varM1 <- varM1[1:500]
dM <- dsample[names(varM1),]
plot(hclust(dist(t(dM))))
A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. It is useful when comparing the gene expression data from different conditions.
Not like clustering, heatmap requires several libraries and color determination. We will prepare the pre-requisite before starting plotting.
For the libraries, the three libraries are required - RColorBrewer, gplots and ggplot2. If they are not installed yet, you could install them with this command.
install.packages("RColorBrewer")
install.packages("gplots")
install.packages("ggplot2")
library(RColorBrewer)
library(gplots)
Attaching package: ‘gplots’
The following object is masked from ‘package:plotrix’:
plotCI
The following object is masked from ‘package:stats’:
lowess
library(ggplot2)
Also, color determination is based on the setting palettes. You could give the color keys what you want, but here, we will try Red-Yellow-Blue scheme.
mypalette <- brewer.pal(11, "RdYlBu")
mycols <- colorRampPalette(mypalette)
Now we are ready to plot the heatmap. For heatmap, it is not that good idea to plot all the genes in the dataset since it would make us harder to understand how much the samples are different each other. We will start with the data generated on the preparation step, but we will take only 200 most variable genes.
varM <- apply(dsample, 1, var)
varM1 <- varM[order(varM, decreasing=TRUE)]
varM1 <- varM1[1:200]
dM <- dsample[names(varM1),]
We will use heatmap.2 function on R. This will help you to create heatmap easily with the matrix-format of dataset. Before starting, I would like to recommend checking the parameters to adjust. I will add some information of parameters I have used, but you could always check the parameters by ?heatmap.2.
heatmap.2(as.matrix(dM),
margins = c(4,5),
keysize = 0.7, #Indicating the size of color key
scale ="row", #ALWAYS KEEP THE ROW
cexRow = 0.5, cexCol = 0.7, #size of axis-label
dendrogram = "row",
Rowv = TRUE, Colv = FALSE, #TRUE = Dendrogram order, #FALSE = Input order
col = rev(mycols(50)), #Number of shades
trace = "none") #Data histrogram on heatmap
Here is the result of heatmap and clustering is based on the genes. We could see that the expression patterns are different between samples, as we expected. If you would like to see the values in detail, what we could try is changing the number of shades to higher number.
Okay, so finally we got the heatmap on our hand! However, what if we would like to extract the information included in the heatmap. Still we have ways to extract the data as a matrix format, which you could play on excel!
write.table(dM, "200_most_variable.csv", quote=F, sep="\t", row.names=T, col.names=T)
To extract Z-score plotted on heatmap, the heatmap should be saved into the variable.
hm <- heatmap.2(as.matrix(dM),
margins = c(7,10),
keysize = 0.8, #Indicating the size of color key
scale ="row", #ALWAYS KEEP THE ROW
cexRow = 0.5, cexCol = 0.7, #size of axis-label
dendrogram = "row",
Rowv = TRUE, Colv = FALSE, #TRUE = Dendrogram order, #FALSE = Input order
col = rev(mycols(50)), #Number of shades
trace = "none") #Data histrogram on heatmap
Then you could extract Z-score from the heatmap saved into the variable hm.
write.table(t(hm$carpet), "Z-score.csv", quote=F, sep="\t", row.names=T, col.names=T)
The last one what we could try with RNA-seq data is PCA plot. Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. It may sound a lot difficult to understand, but it is kind of projection of variances into 2-dimentional plot.
We will work on the dataset which we already generated on preparation step. There are four different libraries required for the PCA plotting.
library(ggplot2)
library(ggfortify)
library(cluster)
library(RColorBrewer)
We could calculate the variances by prcomp function. The data has been already normalized, so we do not have to scale the values in dataset.
pca_tdf <- prcomp(t(dsample), scale.=FALSE)
To generate PCA plot, what we could use is autoplot, since it is the simplest and easiest function we could use for. We could use the calculated PCA values to plot.
This is the basic format of PCA plot. However, we would like to know which samples is which point, we should add the label on the plot.
autoplot(pca_tdf, label = TRUE, label.size = 3)
Well, it seems not so good because the point and label is overlap. What we could try is removing the point, in other words, use the label as point.
autoplot(pca_tdf, label = TRUE, label.size = 3, shape=FALSE)
What if we could calculate clusters on the PCA plot? We could do the K-means clustering on the PCA plot which we have generated. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
To get the K-means clustering, what we could use is kmeans function. Here, we will try several different number of clusters. By adjusting the number of clusters, the color of label is automatically changed by group.
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.2.2
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cluster_2.1.2 ggfortify_0.4.12 ggplot2_3.3.5 gplots_3.1.1 RColorBrewer_1.1-2 plotrix_3.8-1
loaded via a namespace (and not attached):
[1] bslib_0.2.5.1 compiler_4.1.0 pillar_1.6.1 BiocManager_1.30.16 jquerylib_0.1.4 bitops_1.0-7
[7] tools_4.1.0 digest_0.6.27 jsonlite_1.7.2 evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2
[13] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.11 DBI_1.1.1 yaml_2.2.1 xfun_0.24
[19] gridExtra_2.3 withr_2.4.2 dplyr_1.0.7 stringr_1.4.0 knitr_1.33 generics_0.1.0
[25] sass_0.4.0 vctrs_0.3.8 gtools_3.9.2 caTools_1.18.2 tidyselect_1.1.1 grid_4.1.0
[31] glue_1.4.2 R6_2.5.0 fansi_0.5.0 rmarkdown_2.9 farver_2.1.0 tidyr_1.1.3
[37] purrr_0.3.4 magrittr_2.0.1 scales_1.1.1 htmltools_0.5.1.1 ellipsis_0.3.2 assertthat_0.2.1
[43] colorspace_2.0-2 labeling_0.4.2 KernSmooth_2.23-20 utf8_1.2.1 stringi_1.7.3 munsell_0.5.0
[49] crayon_1.4.1